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Abstract 


We propose a novel parameter estimation 
procedure that works efficiently for con¬ 
ditional random fields (CRF). This algo¬ 
rithm is an extension to the maximum 
likelihood estimation (MLE), using loss 
functions defined by Bregman divergences 
which measure the proximity between the 
model expectation and the empirical mean 
of the feature vectors. This leads to a flex¬ 
ible training framework from which mul¬ 
tiple update strategies can be derived us¬ 
ing natural gradient descent (NGD). We 
carefully choose the convex function in¬ 
ducing the Bregman divergence so that the 
types of updates are reduced, while mak¬ 
ing the optimization procedure more ef¬ 
fective by transforming the gradients of 
the log-likelihood loss function. The de¬ 
rived algorithms are very simple and can 
be easily implemented on top of the exist¬ 
ing stochastic gradient descent (SGD) op¬ 
timization procedure, yet it is very effec¬ 
tive as illustrated by experimental results. 


1 Introduction 


Graphical models are used extensively to solve 
NLP problems. One of the most popular models 


is the conditional random fields (CRF) (Lafferty, 


2001), which has been successfully applied to 


tasks like shallow parsing (Fei Sha and Fernando 


Pereira, 2003), name entity recognition (Andrew 

McCallum and Wei Li, 2003 

, word segmenta- 
) etc., just to name 

tion (Fuchun Peng et al. , 2004 


a few. 

While the modeling power demonstrated by 
CRF is critical for performance improvement, 
accurate parameter estimation of the model is 
equally important. As a general structured predic¬ 
tion problem, multiple training methods for CRF 


have been proposed corresponding to different 
choices of loss functions. For example, one of the 
common training approach is maximum likelihood 
estimation (MLE), whose loss function is the (neg¬ 
ative) log-likelihood of the training data and can 


be optimized by algorithms like L-BFGS (Dong 


C. Liu and Jorge Nocedal, 1989[), stochastic gra¬ 


dient descent (SGD) ( fBottou, 1998| , stochastic 


meta-descent (SMD) (Schraudolph, 1999) (Vish- 


wanathan, 2006 ) etc. If the structured hinge-loss is 


chosen instead, then the Passive-Aggressive (PA) 
algorithm (Crammer et aL72006]), structured Per- 
ceptron ( Collins, 2002| ) (corresponding to a hinge- 
loss with zero margin) etc. can be applied for 
learning. 


In this paper, we propose a novel CRF train¬ 
ing procedure. Our loss functions are defined 
by the Bregman divergence (Bregman, 1967) be¬ 
tween the model expectation and empirical mean 
of the feature vectors, and can be treated as a gen¬ 
eralization of the log-likelihood loss. We then use 
natural gradient descent (NGD) (Amari, 1998) to 
optimize the loss. Since for large-scale training, 
stochastic optimization is usually a better choice 
than batch optimization (Bottou, 20081, we focus 
on the stochastic version of the algorithms. The 
proposed framework is very flexible, allowing us 
to choose proper convex functions inducing the 
Bregman divergences that leads to better training 
performances. 


In Section |2j we briefly reviews some back¬ 
ground materials that are relevant to further dis¬ 
cussions; Section [3] gives a step-by-step introduc¬ 
tion to the proposed algorithms; Experimental re¬ 
sults are given in Section [4] followed by discus¬ 
sions and conclusions. 












































2 Background 

2.1 MLE for Graphical Models 


Graphical models can be naturally viewed as ex¬ 
ponential families (Wainwright and Jordan, 2008). 
For example, for a data sample (x, y) where x is 
the input sequence and y is the label sequence, the 
conditional distribution p(y\x ) modeled by CRF 
can be written as 


Pe(y\x) = exp{0 • &(x, y) - A e } 

where = Y 0 c (ie c , y c ) F is the fea- 

ceC 

ture vector (of dimension d) collected from all fac¬ 
tors C of the graph, and Aq is the log-partition 
function. 

MLE is commonly used to estimate the model 
parameters 6. The gradient of the log-likelihood 
of the training data is given by 

E-Ee (1) 

where E and E 0 denotes the empirical mean of and 
model expectation of the feature vectors respec¬ 
tively. The moment-matching condition E = Eg* 
holds when the maximum likelihood solution 6* 
is found. 


its most popular applications. The conventional 
gradient descent assumes the underlying parame¬ 
ter space to be Euclidean, nevertheless this is of¬ 
ten not the case (say when the space is a statistical 
manifold). By contrast, NGD takes the geometry 
of the parameter space into consideration, giving 
an update strategy as follows: 

o t+ i = e t - \ t M e ym) ( 2 ) 


where X t is the learning rate and M is the Rie- 
mannian metric of the parameter space manifold. 
When the parameter space is a statistical manifold, 
the common choice of M is the Fisher information 
matrix, namely Mg 3 = E 


9log Po(x) dlog Pe{x) 


ddi 


99 i 


It is shown in ( Amari, 1998] ) that NGD is asymp¬ 
totically Fisher efficient and often converges faster 
than the normal gradient descent. 

Despite the nice properties of NGD, it is in gen¬ 
eral difficult to implement due to the nontrivial 
computation of Mg 1 . To handle this problem, 
researchers have proposed techniques to simplify, 
approximate, or bypass the computation of Mg 1 , 
for example (Le Roux et al., 2007) (Honkela et al., 


2008) (Pascanu and Bengio, 2013). 


2.2 Bregman Divergence 

The Bregman divergence is a general proximity 
measure between two points p, q (can be scalars, 
vectors, or matrices). It is defined as 

Bg(p, q) = G(p) - G(q) - VG(qf(p - q) 


where G is a differentiable convex function induc¬ 
ing the divergence. Note that Bregman divergence 
is in general asymmetric, namely Bq(p, q) / 
B{ q,p). 

By choosing G properly, many interesting dis¬ 
tances or divergences can be recovered. For ex¬ 
ample, choosing G(u) = tj||u|[ 2 , B G (p, q) = 
||p — q|| 2 and the Euclidean distance is recov¬ 
ered; Choosing G{u) = Y u i log u it B G { p, q) = 

i 

Y Pi log ^7 and the Kullback-Leibler divergence 

i 


is recovered. A good review of the Bregman diver¬ 


gence and its properties can be found in (Banerjee 


et al., 20051). 


2.3 Natural Gradient Descent 

NGD is derived from the study of information ge¬ 
ometry (Amari and Nagaoka, 2000), and is one of 


3 Algorithm 

3.1 Loss Functions 

The loss function defined for our training proce¬ 
dure is motivated by MLE. Since Eg = E needs to 
hold when the solution to MLE is found, the “gap” 
between Eg and E (according to some measure) 
needs to be reduced as the training proceeds. We 
therefore define the Bregman divergence between 
E 0 and E as our loss function. Since the Bregman 
divergence is in general asymmetric, we consider 
four types of loss functions as follows (named B\- 

SJD 


B g (tE0,E - pEo^J 

(Bi) 

B g (I - pE 0 ,7E e ) 

(b 2 ) 


'in the most general setting, we may use B(aE,g — 
6E, cEe — dE) s.t. a — b = c — d, a, b, c, d £ R as the 
loss functions, which guarantees Eg = E holds at its min¬ 
imum. However, this formulation brings too much design 
freedom which complicates the problem, since we are free to 
choose the parameters a, b, c as well as the convex function 
G. Therefore, we narrow down our scope to the four special 
cases of the loss functions B± — B 4 given above. 





















B g (pE, E 0 - 7 E) (B 3 ) 

B g (e 0 - 7 E, pEj (- 64 ) 


where 7 G M and p = 1 — 7 . It can be seen that 
whenever the loss functions are minimized at point 
0* (Bregman divergence reaches zero), E 0 * = E 
and 0* give the same solution as MLE. 

We are free to choose the hyper-parameter 7 
which is possibly correlated with the performance 
of the algorithm. However to simplify the prob¬ 
lem setting, we will only focus on the cases where 
7 = 0 or 1. Although it seems we now have eight 
versions of loss functions, it will be seen shortly 
that by properly choosing the convex function G, 
many of them are redundant and we end up having 
only two update strategies. 


of our loss functions tractable^ 

This trick applies to all gradients in Table [T] 
yielding multiple update strategies. Note that 
V 0 Hi ,7 = 1 is equivalent to V 0 T> 4,7 = 0 , and 
V 0 H 2,7 = 1 is equivalent to VgB 3 , 7 = 0. By 
applying Eq. [3]to all unique gradients, the follow¬ 
ing types of updates are derived (named Ui-Uq): 


0t+ 1 — Ot 
Ot+i = Of 

Ot+i = Of 

Ot+i = Ot 

Ot+i = o t 
Ot+i = o t 


A t V 2 G(E-E 0 t )(E 0t -E) 
A* [vG(0) - VG(E-E 0t ) 
A t V 2 G(E 0 t -E)(E 0 t -E) 
AtV 2 G(E 0 t )(E 0t -E) 


At 

At 


VG(E 0t - E) - VG(0) 
VG(E 0t ) - VG(1) 


(^ 1 ) 

(U 2 ) 

m 

m 

m 

m 


3.2 Applying Natural Gradient Descent 

The gradients of loss functions B 1 -B 4 with respect 
to 0 are given in Table [T] 


Loss 


Gradient wrt. G 

7 

B 1 

V e E e [VG(Ee) - VG(E)J 

1 


V e E e V 2 G(E - E e )(E 0 - E) 

0 

Bo 

V e E e V 2 G(Ee)(E e - E) 

1 


VgEg 

VG( 0 ) - VG(E-Ee) 

0 

b 3 

V6>E0V 2 G ? (E0 — 7E)(Ea — E) 

{0,1} 

b 4 

VeEe 

VG(E e - 7E) - VG(pE) 

{0,1} 


Table 1: Gradients of loss functions B1-B4. 


It is in general difficult to compute the gradients 
of the loss functions, as all of them contain the 
term V 0 E 0 , whose computation is usually non¬ 
trivial. However, it turns out that the natural gra¬ 
dients of the loss functions can be handled easily. 
To see this, note that for distributions in exponen¬ 
tial family, V 0 E 0 = V 0 A 0t = M 0t , which is the 
Fisher information matrix. Therefore the NGD up¬ 
date (Eq. [2]) becomes 

0 t+ i =0 t - AtV e E -£Vt{O t ) (3) 


3.3 Reducing the Types of Updates 

Although the framework described so far is very 
flexible and multiple update strategies can be de¬ 
rived, it is not a good idea to try them all in turn for 
a given task. Therefore, it is necessary to reduce 
the types of updates and simplify the problem. 

We first remove C/4 and Uq from the list, since 
they can be recovered from U 3 and C/5 respectively 
by choosing G'(u) = G(u + E). To further re¬ 
duce the update types, we impose the following 
constraints on the convex function G: 

1 . G is symmetric: G(u) = G(— u). 

2. VG(u) is an element-wise function, namely 

VG(u)j = gi(ui),\/i G 1 where g t is 

a uni-variate scalar function. 

For example, G(u) = 11u11 2 is a typical func¬ 

tion satisfying the constraints, since ^ ||u || 2 = 
||| — u|| 2 , and VG(u) = [u±,... ,Ud] T where 
gi (u) = u, VC G 1,..., d. It is also worth mention¬ 
ing that by choosing G(u) = A |u 11 2 , all updates 
C/i-C/ 6 become equivalent: 


Now if we plug, for example VgB 1,7 = 0, into 
Eq. [3] we have 

0f+\ = Ot — XtV gE g ^V gB2 

= 0 t - \ t V 2 G(E - Eg t )(Eg t - E)(4) 

Thus the step of computing the the Fisher informa¬ 
tion can be circumvented, making the optimization 


Ot+i = 0 t — Xt (Eg t — E^ 


which recovers the GD for the log-likelihood func¬ 
tion. 


2 In (Hoffman et at ., 20 13 ], a similar trick was also used. 
However their technique was developed from a specific vari¬ 
ational inference problem setting, whereas the proposed ap¬ 
proach derived from Bregman divergence is more general. 



















When a twice-differentiable function G satisfies 
the constraints 1 and 2 , we have 

VG(u) = -VG(-u) (5) 

VG(O) = 0 ( 6 ) 

V 2 G(u) = V 2 G(-u) (7) 

where Eq. [ 7 ] holds since V 2 G is a diagonal ma¬ 
trix. Given these conditions, we see immediately 
that U\ is equivalent to C/ 3 , and U 2 is equivalent to 
U 5 . This way, the types of updates are eventually 

narrowed down to U\ and U- 2 - 

To see the relationship between U\ and U 2 , note 

that the Taylor expansion of VG(O) at point E — 
Eg t is given by 


is sensitive to the condition number of the ob¬ 


jective function (Bottou, 2008), one heuristic for 
the selection of G is to make the condition num¬ 
bers of Sj and S 2 smaller than that of the log- 
likelihood. However, this is hard to analyze since 
the condition number is in general difficult to com¬ 
pute. Alternatively, we may select a G so that the 
second-order information of the log-likelihood can 
be (approximately) incorporated, as second-order 
stochastic optimization methods usually converge 
faster and are insensitive to the condition number 


of the objective function (Bottou, 2008). This is 
the guideline we follow in this section. 

The first convex function we consider is as fol¬ 
lows: 


VG( 0 ) =VG(E-E et ) + V 2 G(E-E et )(E et - E) 

+ 0 (||Ee t — E|| 2 ) 

Therefore U\ and U 2 can be regarded as approxi¬ 
mations to each other. 

Since stochastic optimization is preferred for 
large-scale training, we replace E and with 
its stochastic estimation K t and E g ut , where E* = 
$(x t ,yt) andE 0iit = E POt iy\x t ) [$(®t,y)]. As¬ 
suming G satisfies the constraints, we re-write U\ 
and U 2 as 

0 t +i = 0t- A t V 2 G(E 0 ^ - Et)(E Bt - Et) (U?) 
Ot+i = 0 t - A t VG(E 0iit - E t ) (C/|) 

which will be the focus for the rest of the paper. 


a i/2 

Ui , / Ui s 1 ( U i 

-log 1 + — 


G i( u ) = arctan( —) - - 




where e > 0 is a small constant free to choose. 
It can be easily checked that G\ satisfies the con¬ 


straints imposed in Section 3.3 The gradients of 
G\ are given by 


1 


VGi(u) = -= 


arctan( —p), 
v e 


V 2 Gi(u) = diag 


1 


, arctan (—^=) 
v e 


1 




L uf + e 

In this case, the U{ update has the following 
form (named 17*.Gi): 


og, = o\-\ t 


— Et ^ + e ^ 


3.4 Choosing the Convex Function G 

We now proceed to choose the actual functional 
forms of G aiming to make the training procedure 
more efficient. 

A naive approach would be to choose G from 
a parameterized convex function family (say the 
vector p-norm, where p > 1 is the hyper¬ 

parameter), and tune the hyper-parameter on a 
held-out set hoping to find a proper value that 
works best for the task at hand. However this ap¬ 
proach is very inefficient, and we would like to 
choose G in a more principled way. 

Although we derived and t/| from NGD, 
they can be treated as SGD of two surrogate loss 
functions S\{ 6 ) and 82 (d) respectively (we do 
not even need to know what the actual functional 
forms of Si, S 2 are), whose gradients V^S 1 = 
V 2 G(E 0 —E)(E 0 —E)andV 0 S 2 = VG(E 0 -E) 
are transformations of the gradient of the log- 
likelihood (Eq. |T|). Since the performance of SGD 


where i = 1..... d This update can be treated 
as a stochastic second-order optimization proce¬ 
dure, as it scales the gradient Eq. [l]by the inverse 
of its variance in each dimension, and it reminds 


us of the online Newton step (ONS) (Hazan et 


al., 2007 algorithm, which has a similar update 
step. However, in contrast to ONS where the full 
inverse covariance matrix is used, here we only 
use the diagonal of the covariance matrix to scale 
the gradient vector. Diagonal matrix approxima¬ 
tion is often used in optimization algorithms incor¬ 
porating second-order information (for example 
SGN-QN ( jBordes et al., 2009j ), AdaGrad (Duchij 
et al., 2011 ) etc.), as it can be computed orders-of- 
magnitude faster than using the full matrix with¬ 
out sacrificing much performance. 

The U% update corresponding to the choice of 
Gi has the following form (named U%.G 1 ): 


6\ +1 = 0\ — At arctan 


,t — 
























Figure 1: Comparison of functions u/(u 2 + 
0.1), arctan, erf, gd. The sigmoid functions 
arctan, erf and gd are normalized so that they 
have the same gradient as u/(u 2 + 0 . 1 ) at u = 0 
and their values are bounded by - 1 , 1 . 


the functions G 2 , G 3 , however it can be checked 
that both of them satisfy the constraints. The rea¬ 
son why we select these two functions from the 
sigmoid family is that when the gradients of erf 
and d are the same as that of arctan at point 
zero, both of them stay on top of arctan. There¬ 
fore, erf and gd are able to give stronger boosts 
to small gradients. This is also illustrated in Fig¬ 
ure m 

Applying VG 2 and VG 3 to [/|, we get two up¬ 
dates named G| . 6'2 and U^-Gz respectively. We 
do not consider further applying V 2 G 2 and V :! G';s 
to U {, as the meaning of the updates becomes less 
clear. 


Note that the constant 1/y^ in VG'i has been 
folded into the learning rate X t . Although not ap¬ 
parent at first sight, this update is in some sense 
similar to U*.G\. From t/f.Gi we see that in each 

~ i 

dimension, gradients Eg — E< with smaller ab¬ 
solute values get boosted more dramatically than 
those with larger values (as long as |Eg — E* | 
is not too close to zero). A similar property also 
holds for t/|.Gi, since arctan is a sigmoid func¬ 
tion, and as long as we choose e < 1 so that 


d 

— arctan 
du 



1 

u =0 — 7 = 1 

Ve 


the magnitude of Eg — E t with small absolute 
value will also get boosted. This is illustrated in 
Figure |Tj by comparing functions (where we 

select e = 0.1) and arctan uj. Note that for 


many NLP tasks modeled by CRF, only indicator 
features are defined. Therefore the value of Eg — 


Et is bounded by -1 and 1 , and we only care about 
function values on this interval. 

Since arctan belongs to the sigmoid function 
family, it is not the only candidate whose corre¬ 
sponding update mimics the behavior of U*.Gi, 
while G still satisfies the constraints given in Sec¬ 
tion IT31 We therefore consider two more con¬ 
vex functions G 2 and G 3 , whose gradients are the 
erf and Gudermannian functions (gd) from the 
sigmoid family respectively: 


2 r au i 

VG 2 (au)j = — exp{— x 2 }da 

n Jo 


7r 


VG 3 (/ 3 u)j = 2 arctan(exp{/3«j}) — — 


where a,/3 G M are hyper-parameters. In this 
case, we do not even know the functional form of 


3.5 Adding Regularization Term 

So far we have not considered adding regulariza¬ 
tion term to the loss functions yet, which is crucial 
for better generalization. A common choice of the 
regularization term is the 2-norm (Euclidean met¬ 
ric) of the parameter vector y||0|| 2 , where C is 
a hyper-parameter specifying the strength of reg¬ 
ularization. In our case, however, we derived our 
algorithms by applying NGD to the Bregman loss 
functions, which assumes the underlying parame¬ 
ter space to be non-Euclidean. Therefore, the reg¬ 
ularization term we add has the form [, 0 1 MqO, 
and the regularized loss to be minimized is 

je T M e e + B t ( 8 ) 

where B{,i = { 1 , 2 ,3,4} are the loss functions 
B 1 -B 4 . 

However, the Riemannian metric Me itself is a 
function of 9, and the gradient of the objective at 
time t is difficult to compute. Instead, we use an 
approximation by keeping the Riemannian metric 
fixed at each time stamp, and the gradient is given 
by Me t 9 t + Ve t Bi. Now if we apply NGD to 
this objective, the resulting updates will be no dif¬ 
ferent than the SGD for L2-regularized surrogate 
loss functions Si, S 2 ' 


Ot +1 = (1 - CXt) (e t - 1 _ Xt CA VeSi, t (O t )) 

where i = { 1, 2}. It is well-known that SGD for 
L2-regularized loss functions has an equivalent but 
more efficient sparse update (Shalev-Schwartz et 













al. , 2007) (Bottou, 2012 1 : 


zt 

9t +1 =9 t - 


X t. 


(1 - A t )zt 
zt +i = (1 - C\ t )zt 


VeSi,t(6 t ) 


where zt is a scaling factor and zo = 1. We then 
modify U* and accordingly, simply by chang¬ 
ing the step-size and maintaining a scaling factor. 
As for the choice of learning rate At, we follow 


the recommendation given by (Bottou, 2012) and 

-l 

set X t = 


, where A is calibrated 
le training data before the 


A(1 + XCt) 
from a small subset of t 
training starts. 

The final versions of the algorithms developed 
so far are summarized in Figure [2] 


3.6 Computational Complexity 

The proposed algorithms simply transforms the 
gradient of the log-likelihood, and each transfor¬ 
mation function can be computed in constant time, 
therefore all of them have the same time complex¬ 
ity as SGD (0(d) per update). In cases where only 
indicator features are defined, the training process 
can be accelerated by pre-computing the values of 
V 2 G or VC for variables within range [— 1 , 1 ] and 
keep them in a table. During the actual training, 
the transformed gradient values can be found sim¬ 
ply by looking up the table after rounding-off to 
the nearest entry. This way we do not even need 
to compute the function values on the fly, which 
significantly reduces the computational cost. 


4 Experiments 

4.1 Settings 


We conduct our experiments based on the follow¬ 
ing settings: 

Implementation-. We implemented our algo¬ 
rithms based on the CRF suite toolkit ( jOkazaki] 


2007), in which SGD for the log-likelihood loss 


with L2 regularization is already implemented. 
This can be easily done since our algorithm only 
requires to modify the original gradients. Other 
parts of the code remain unchanged. 

Task: Our experiments are conducted for 

the widely used CoNLL 2000 chunking shared 
task (Sang and Buchholz, 2000). The training and 
test data contain 8939 and 2012 sentences respec¬ 
tively. For fair comparison, we ran our experi¬ 
ments with two standard linear-chain CRF feature 


Initialization: 

Choose hyper-parameters C, e/a/(3 (depending 
on which convex function to use: G\, G 2 , G 3 ). 
Set zq = 1, and calibrate A on a small training 
subset. 

Algorithm: 

for e = 1... num epoch do 

for t = 1... T do 


Receive training sample ( xt,yt ) 

Af = A(1 + XCt ) , 6 t = ZfOt 

Depending on the update strategy selected, 
update the parameters: 

= «i - (AfeVe-W) 

where VgS t (O t )-, i = 1,..., d is given by 


~ Et ^ + e — E t ^ 


arctan 


erf 


^a(Eg tjt — E t )j 
Ek t — E t 


gd (/?(E^-E/)) 


(UVG 1 ) 

(UZ.Go) 

(u;.g 3 ) 


Zt+i = (1 - CXt)z t 


if no more improvement on training set 

then 
exit 
end if 
end for 
end for 

Figure 2: Summary of the proposed algorithms. 


sets implemented in the CRF suite. The smaller 
feature set contains 452,755 features, whereas the 
larger one contains 7,385,312 features. 

Baseline algorithms : We compare the perfor¬ 
mance of our algorithms summarized in Figure [2] 
with SGD, L-BFGS, and the Passive-Aggressive 
(PA) algorithm. Except for L-BFGS which is 
a second-order batch-learning algorithm, we ran¬ 
domly shuffled the data and repeated experiments 
five times. 

Hyper-parameter selection : For convex func¬ 
tion G\ we choose e = 0.1, correspondingly the 
arctan function in U 2 .G 1 is arctan(3.16ri). We 
have also experimented the function arctan (lOrt) 
for this update, following the heuristic that the 
transformation function u2 1 given by V 2 C'] 




















Training set F-score 


and arctan(lOu) have consistent gradients at zero, 
since the C/| update imitates the behavior of Uf 
(the two choices of the arctan functions are 
denoted by arctan. 1 and arctan. 2 respec¬ 
tively). Following the same heuristic, we choose 
a = t ~ 8.86 for the erf function and 
/3 = 10 for the gd function. 

4.2 Results 

Comparison with the baseline'. We compare the 
performance of the proposed and baseline algo¬ 
rithms on the training and test sets in Figure [3] and 
Figure [4j corresponding to small and large feature 
sets respectively. The plots show the F-scores on 
the training and test sets for the first 50 epochs. To 
keep the plots neat, we only show the average F- 
scores of repeated experiments after each epoch, 
and omit the standard deviation error bars. For 
U^-Gi update, only arctan .1 function is reported 
here. From the figure we observe that: 

1. The strongest baseline is given by SGD. 
By comparison, PA appears to overfit the train¬ 
ing data, while L-BFGS converges very slowly, al¬ 
though eventually it catches up. 

2. Although the SGD baseline is already very 
strong (especially with the large feature set), 
both the proposed algorithm U±.G i and UfG \ 
outperform SGD and stay on top of the SGD 
curves most of the time. On the other hand, 
the U^-Gi update appears to be a little more 
advantageous than [/* .G\. 

Comparison of the sigmoid functions'. Since 
arctan, erf and gd are all sigmoid functions 
and it is interesting to see how their behaviors 
differ, we compare the updates UfGl (for both 
arctan.1 and arctan .2), (A).6 '2 and (A). G:> in 
Figure [5] and Figure [6j The strongest baseline, 
SGD, is also included for comparison. From the 
figures we have the following observations: 

1. As expected, the sigmoid functions demon¬ 
strated similar behaviors, and performances of 
their corresponding updates are almost indistin¬ 
guishable. CA|.G 2 - Similar to U^.Gl, [/|.G2 and 
(AJ. 6'3 both outperformed the SGD baseline. 

2. Performances of U^.G 1 given by arctan 
functions are insensitive to the choice of the hyper¬ 
parameters. Although we did not run similar ex¬ 
periments for erf and gd functions, similar prop¬ 
erties can be expected from their corresponding 
updates. 
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Figure 3: F-scores of training and test sets given 
by the baseline and proposed algorithms, using the 
small feature set. 


Training set F-score 
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Figure 6: F-scores of training and test sets given 
by U 2 - G1,U 2 - Gl and f7|.G3, using the large fea¬ 
ture set. 


Finally, we report in Table [2]the F-scores on the 
test set given by all algorithms after they converge. 


5 Conclusion 

We have proposed a novel parameter estimation 
framework for CRF. By defining loss functions us¬ 
ing the Bregman divergences, we are given the 
opportunity to select convex functions that trans¬ 
form the gradient of the log-likelihood loss, which 
leads to more effective parameter learning if the 
function is properly chosen. Minimization of 
the Bregman loss function is made possible by 
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Figure 4: F-scores of training and test sets given 
by the baseline and proposed algorithms, using the 
large feature set. 


Figure 5: F-scores of training and test sets given 
by U 2 -GT, U^-Gl and C/| .G3, using the small fea¬ 
ture set. 


Algorithm 

F-score % (small) 

F-score % (large) 

SGD 

95.98±0.02 

96.02i0.01 

PA 

95.82±0.04 

95.90i0.03 

L-BFGS 

96.00 

96.01 

U*.Gi 

95.99i0.02 

96.06i0.03 

UfGi 

96.02±0.02 

96.06i0.02 

UfG\ 

96.03 i 0.01 

96.06i0.03 

UfGi 

96.03i0.02 

96.06i0.02 

U 2 .G 3 

96.02i0.02 

96.06i0.02 


Table 2: F-scores on the test set after algorithm 
converges, using the small and large feature sets. 
U^-Gi is the update given by arctan.l, and 
C/|.G2 given by arctan . 2. 

NGD, thanks to the structure of exponential fam¬ 
ilies. We developed several parameter update 
strategies which approximately incorporates the 
second-order information of the log-likelihood, 
and outperformed baseline algorithms that are al¬ 
ready very strong on a popular text chunking task. 

Proper choice of the convex functions is crit¬ 
ical to the performance of the proposed algo¬ 
rithms, and is an interesting problem that mer¬ 
its further investigation. While we selected the 
convex functions with the motivation to reduce 
the types of updates and incorporate approximate 
second-order information, there are certainly more 
possible choices and the performance could be im¬ 
proved via careful theoretical analysis. On the 
other hand, instead of choosing a convex function 
apriori, we may rely on some heuristics from the 
actual data and choose a function tailored for the 
task at hand. 
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